Experimental Evaluation of Feature Selection Methods for Clustering

نویسندگان

  • Martin Azizyan
  • Aarti Singh
  • Wei Wu
چکیده

Background. Recently several clustering methods have been proposed which perform feature selection with the goal of finding structure in high dimensional data that is confined to a small set of features. Most of these methods are stated in terms of non-convex optimization objectives, and are computationally intractable to solve exactly in general. In this project we consider two such methods, both of which are stated as modifications of the famous K–means objective. Approximation algorithms proposed for these methods require initialization with some clustering, which is taken to be the result of standard K–means. Aim. Our goal is to experimentally characterize the behavior of these algorithms, specifically the type of structure that they can and cannot discover, how they differ in terms of performance, and what effect the approximation methods have on the clustering and feature selection results. Data. We use a dataset containing a detailed phenotypic characterization of 378 participants in a lung disease study in terms of 112 demographic, environmental, and medical features. The variable types are mixed, including continuous, ordinal, and nominal features. Methods. Synthetic datasets are designed to elucidate the performance and behavior of the methods in consideration, both in terms of the optimal solutions for the optimization problems and the quality of the solutions obtained from the approximation algorithms. We formulate greedy K–means, a new feature selection method for clustering, inspired by the two methods in consideration. We propose a few initialization methods as alternatives to K–means initialization. We also propose a clustering criterion for mixed variable types. This criterion can directly be adapted to perform feature selection with the same approach used in greedy K–means. Results. Based on synthetic experiments our proposed greedy K–means method seems to never perform much worse than the other two, and it may be advantageous for interpretability because it gives exactly sparse solutions. We show examples where the previously proposed initialization method of using the K–means solution performs arbitrarily poorly, and that alternate initialization is needed. Among the initialization methods we consider, the one that seems to perform best is random support based initialization, and greedy K–means lends itself to that method naturally. We show that the result of applying these methods to the lung disease data is unstable in terms of the resulting clustering, and that the selected features are highly influenced by correlations between variables, which warrants further investigation to ensure there is any significant non-linear structure in the data found by the clustering methods. Conclusions. We provide insights regarding the challenges presented by noisy, high-dimensional data when applying the feature-sparse clustering methods in consideration. We demonstrate the practical importance of not ignoring the approximate nature of the optimization procedures available for these methods. Since linear structure (i.e. correlations) are overwhelmingly likely to exist in real-world data sets, our results indicate that care needs to be taken regarding the type of structure discovered when interpreting the meaning of selected features.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...

متن کامل

Modeling and design of a diagnostic and screening algorithm based on hybrid feature selection-enabled linear support vector machine classification

Background: In the current study, a hybrid feature selection approach involving filter and wrapper methods is applied to some bioscience databases with various records, attributes and classes; hence, this strategy enjoys the advantages of both methods such as fast execution, generality, and accuracy. The purpose is diagnosing of the disease status and estimating of the patient survival. Method...

متن کامل

سودمندی رگرسیون‌های تجمیعی و روش‌های انتخاب متغیرهای پیش‌بین بهینه در پیش‌بینی بازده سهام

مقاله حاضر به بررسی سودمندی رگرسیون‌های تجمیعی و روش‌های انتخاب متغیرهای پیش‌بین بهینه (شامل روش مبتنی بر همبستگی و ریلیف) برای پیش‌بینی بازده سهام شرکت‌های پذیرفته شده در بورس اوراق بهادار تهران می‌پردازد. به‌منظور ارزیابی عملکرد رگرسیون تجمیعی، معیارهای ارزیابی (شامل میانگین قدرمطلق درصد خطا، مجذور مربع میانگین خطا و ضریب تعیین) مربوط به پیش‌بینی این روش، با رگرسیون خطی و شبکه‌های عصبی مصنوعی...

متن کامل

MLIFT: Enhancing Multi-label Classifier with Ensemble Feature Selection

Multi-label classification has gained significant attention during recent years, due to the increasing number of modern applications associated with multi-label data. Despite its short life, different approaches have been presented to solve the task of multi-label classification. LIFT is a multi-label classifier which utilizes a new strategy to multi-label learning by leveraging label-specific ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014